An Extended Application of the Fast Multi-Locus Ridge Regression Algorithm in Genome-Wide Association Studies of Categorical Phenotypes

Categorical (either binary or ordinal) quantitative traits are widely observed to measure count and resistance in plants. Unlike continuous traits, categorical traits often provide less detailed insights into genetic variation and possess a more complex underlying genetic architecture, which presents additional challenges for their genome-wide association studies. Meanwhile, methods designed for binary or continuous phenotypes are commonly used to inappropriately analyze ordinal traits, which leads to the loss of original phenotype information and the detection power of quantitative trait nucleotides (QTN). To address these issues, fast multi-locus ridge regression (FastRR), which was originally designed for continuous traits, is used to directly analyze binary or ordinal traits in this study. FastRR includes three stages of continuous transformation, variable reduction, and parameter estimation, and it can computationally handle categorical phenotype data instead of link functions introduced or methods inappropriately used. A series of simulation studies demonstrate that, compared with four other continuous or binary or ordinal approaches, including logistic regression, FarmCPU, FaST-LMM, and POLMM, the FastRR method outperforms in the detection of small-effect QTN, accuracy of estimated effect, and computation speed. We applied FastRR to 14 binary or ordinal phenotypes in the Arabidopsis real dataset and identified 479 significant loci and 76 known genes, at least seven times as many as detected by other algorithms. These findings underscore the potential of FastRR as a very useful tool for genome-wide association studies and novel gene mining of binary and ordinal traits.


Introduction
Categorical (either binary or ordinal) quantitative traits are very important and widely observed to measure count and resistance in plants and crop cultivars.For example, the presence (1) or absence (0) of rolled leaves, and susceptibility (1) or resistance (0) to pests are binary phenotypes.The level of leaf serration at 22 • C, ranging from 0 (entire lamina) to 1.5 (sharp/jagged serration) in Arabidopsis thaliana, and the infection type, with scores of 0~4, for leaf rust in wheat, are ordinal phenotypic responses.Ordinal traits are represented by an ordered series of numeric value (degree of infections, 0, 2, 3, etc.).
Categorical traits, as special cases of quantitative traits, present a discontinuous distribution of phenotypes, and breeding tests show phenotypic features that cannot be readily explained by simple strict Mendelian inheritance [1].Many traits with low heritability have ordered categorical scores, such as susceptibility or resistance to a disease, and they exhibit less genetic information [2].Thus, genetic mechanisms of categorical traits are complex, treated these binary or ordinal traits as continuous data, and then employed the FarmCPU method.Regardless, as argued previously, neither of these strategies would be appropriate.
Recently, Zhang et al. [22] proposed fast multi-locus ridge regression (FastRR) for continuous phenotypes, which efficiently handles datasets where the number of markers significantly exceeds the sample size-a scenario in which most penalization methods typically struggle.In this paper, we extend FastRR to directly apply to binary and ordinal traits in genome-wide association studies.The algorithm first converts binary or ordinal phenotypes into continuous data by correcting for polygenic background and population structure, rather than using link functions.Then, it screens a small number of potential candidate loci based on correlation to construct a multi-locus model.Finally, it implements parameter estimation using deshrinking ridge regression to identify significant loci associated with the binary or ordinal traits of interest.A series of simulated as well as Arabidopsis thaliana real data analyses are used to verify the performance of FastRR in categorical phenotypes.Four other existing continuous/binary/ordinal approaches, including logistic regression [26], FarmCPU [16], FaST-LMM [9], and POLMM [5], are used for comparison analysis.Collectively, this work provides the implementation of an alternative GWAS approach for binary and ordinal phenotypes and ultimately contributes toward identifying the genetic mechanisms of complex traits in plants and crop cultivars.

The Calculation of the Mixed Linear Model
Let y i (i = 1, 2, . . ., n) be the binary or ordinal phenotype value of the ith individual in a sample of size n from a natural population, and the MLM can be described as follows: where y = (y 1 , y 2 , • • • , y n ) T is an n × 1 vector of phenotype value.α is a c × 1 fixed effect vector, including the population structure, principal component, intercept, and so on; W is the correspondingly designed matrix for α, whose dimension is n × c; G is an n × 1 vector of marker genotypes, β ∼ N 0, σ 2 β is a random effect of putative QTN, and σ 2 β is variance of putative QTN; u ∼ MV N 0, σ 2  u K is an n × 1 vector of the polygenic effect, K is an n × n known kinship matrix, and σ 2 u is the variance of polygenic background; ε ∼ MV N 0, σ 2 I n is an n × 1 vector of the residual, I n is an n × n identity matrix, and σ 2 is residual variance.N and MV N denote a univariate and multivariate normal distribution, respectively.
As β is treated as a random effect, the variance of y in model ( 1) is as follows: where

Fast Multi-Locus Ridge Regression Algorithm
The FastRR algorithm is a multi-stage flexible approach for GWAS, which simultaneously implements detection and estimation for associated loci.We describe it with the following stages (Figure 1).

Continuous-Transformed Stage
A transformation matrix is generated using the FASTmrEMMA method [19,20] in this stage.The key point of solving the model ( 1) is to estimate two ratios of variance components,  and  , which cause expensive computational burden.It is noted that polygenic variance is always larger than zero, while variance of the majority of SNPs is zero, because these loci are not associated with the trait, which means  = 0. Therefore, in model (1), we delete , and estimate  with the reduced model with only the polygenic background, and replace  by  in model (3) [19,20], avoiding the re-estimation of  for each single-marker scan, thus where An eigen decomposition of the positive definite matrix  is: in which  is orthogonal and  is a diagonal matrix with positive eigenvalues.Let  =   ; the model ( 1) is changed to: where  = ,  = ,  = ,  =  + ,  ~(,   ) Through this step in model ( 5), we transform binary or ordinal into continuous phenotype values for subsequent analysis.At the same time, FastRR also fully considers polygenic background and population structure.

Continuous-Transformed Stage
A transformation matrix is generated using the FASTmrEMMA method [19,20] in this stage.The key point of solving the model ( 1) is to estimate two ratios of variance components, λ β and λ u , which cause expensive computational burden.It is noted that polygenic variance is always larger than zero, while variance of the majority of SNPs is zero, because these loci are not associated with the trait, which means λ β = 0. Therefore, in model (1), we delete Gβ, and estimate λu with the reduced model with only the polygenic background, and replace λ u by λu in model (3) [19,20], avoiding the re-estimation of λ u for each single-marker scan, thus where An eigen decomposition of the positive definite matrix B is: in which Q is orthogonal and Λ is a diagonal matrix with positive eigenvalues.Let C = QΛ − 1 2 Q T ; the model ( 1) is changed to: where Through this step in model ( 5), we transform binary or ordinal into continuous phenotype values for subsequent analysis.At the same time, FastRR also fully considers polygenic background and population structure.

Variable Reduction Stage
Numerous studies have illustrated that most quantitative traits are controlled by a small part of genes, including a few genes with large effects and poly genes with small effects [18,27].Thus, it is important to dissect all significantly associated loci from a large number of markers.Here, we perform a variable reduction phase in FastRR to detect a subset of variables associated with phenotypes, with the aim of reducing the computational complexity of high-dimensional analysis.
We calculate the correlation coefficient between y C and G C in model (5) for each marker, and the function cor.test in R returns the p-value of the correlation test.The threshold of significance was set to a p-value < 0.01 [28] and uncorrelated motifs were removed.At the next stage, all potential loci are selected to construct a reduced multi-locus model.Essentially, this correlation step is similar to the single-marker scanning, which combines with the polygenic background without considering the variance component σ 2 β .

Parameter Estimation Stage
In the reduced multi-locus model, where y C is the continuous-transformed phenotype vector of quantitative traits, α is the vector of fixed effects, W C is the corresponding design matrix for α, ε C ∼ MV N 0, σ 2 I n , and σ 2 is residual variance, all of which are the same as in model (5).β = β 1, β 2 , . . ., β q T is a q × 1 random effect vector of the selected q SNPs from the above step, and C is an n × q genotype matrix of q markers after continuous transformation.Here, the polygenic background is not considered in model (6), because in the above two steps we have selected all potential associated QTNs under polygenic background.All parameters in model ( 6) are estimated by deshrinking ridge regression (DRR) [29].
The estimated effect and its variance for the DRR for the kth marker are respectively, where which follows a Chi-square distribution with one degree of freedom under the null model, H 0 : β k = 0. Bonferroni correction was used and the significance threshold was set to 0.05/q in the analysis [29].

Logistic Regression
A generalized linear model (GLM) [26] is a generalization of the general linear model, which can be applied to continuous, binary, and count data.R software (Version 4.2.1)provides the function glm() for fitting generalized linear models.When the parameter 'family' is set to 'binomial', it specifies a logistic regression model for a binary trait, which is equivalent to using '--logistic' in PLINK.Currently, the function glm() has not been designed for a multinomial distribution family, so we dichotomized the ordinal trait to a binary phenotype for input by defining 1 or 0 depending on whether the ordinal value is more than its mean.Given the issues of computational costs, we constructed a singlelocus model and performed logistic regression for each marker.The function glm() can be found at https://search.r-project.org/CRAN/refmans/rms/html/Glm.html(accessed on 15 May 2023).

FarmCPU
FarmCPU, which is a multi-locus MLM method, was proposed by Liu et al. [16].To completely eliminate the confounding between testing markers and kinship, FarmCPU divides a multi-locus mixed model into two parts: a fixed-effect model (FEM) and a random effect model (REM), and uses them iteratively.An FEM features testing markers, one at a time, and multiple associated markers as covariates to control false positives.These associated markers are named as pseudo QTNs.To avoid model over-fitting problems in FEMs, pseudo QTNs are estimated by an REM, where the pseudo QTNs are used to define kinship.An FEM and REM are used iteratively until no change occurs on the pseudo QTNs.FarmCPU is designed for continuous data, so we employed it by treating the binary or ordinal trait as a continuous phenotype for input.The method was implemented by the R package GAPIT (https://www.zzlab.net/GAPIT/(accessed on 5 June 2023)).

FaST-LMM
The linear mixed model (LMM) tackles confounders by using measures of genetic similarity to capture the probabilities that pairs of individuals have causative alleles in common.For large-scale datasets, the time required to construct a genetic similarity matrix using all SNPs is too long, and the memory required is too large.To address this issue, FaST-LMM [9], which is a single-locus model, builds a realized relationship matrix by partially sampling 200~2000 markers, which improves computational efficiency.However, this algorithm is used to analyze continuous quantitative traits, not for binary or ordinal traits.Therefore, we also employed FaST-LMM by treating the binary or ordinal trait as a continuous phenotype for input.The method was implemented by the R package GAPIT (https://www.zzlab.net/GAPIT/(accessed on 12 June 2023)).

POLMM
POLMM [5] is a recently designed single-locus GWAS method for the ordinal trait using a proportional odds logistic mixed model.POLMM performs penalized quasilikelihood and average information restricted maximum likelihood algorithms to efficiently fit the mixed model, and uses saddle-point approximation to calculate the p-value.It can effectively control the type I error rate.The algorithm was implemented by the POLMM software package (Version 0.2.3) (https://github.com/WenjianBI/POLMM(accessed on 3 July 2023)).
In summary, among the above four comparison methods, logistic regression is used for binary traits, FarmCPU and Fast-LMM are designed for continuous traits, POLMM can handle binary or ordinal traits.The objective of this study is to directly apply the FastRR algorithm to binary and ordinal traits and to evaluate its performance in the QTN detection of categorical phenotypes.Bonferroni correction was used in all comparison methods.

Simulation Data
We generated genotypes according to the minor allele frequency (MAF) in the interval (0.1, 0.5) under Hardy-Weinberg equilibrium.The simulation datasets contained 2000 individuals with 10,000 genetic variants.The total average and residual variance were both set at 10.0.On this basis, three simulation experiments were generated from the following mixed linear models.
For the first simulation experiment, a fixed-position QTN was simulated placed on SNP 98 with a small effect size of 0.461.For the second simulation experiment, five fixedposition QTNs were placed on SNP 98, 301, 540, 801, and 1000, with effects of 0.545, 0.862, Plants 2024, 13, 2520 7 of 20 0.860, 1.079, and 1.209, respectively.For the third simulation experiment, we randomly selected 10 QTNs with an MAF > 0.3, and the total heritability of 10 QTNs was less than 19%, among which the maximum was 3.16% and the minimum was 0.523%.Additionally, three scenarios for each simulation experiment were considered in a mixed linear model due to the varying degree of polygenic backgrounds among different species, including two times the polygenic background (2 k), five times the polygenic background (5 k), and ten times the polygenic background (10 k).
To investigate the performance of different distribution and hierarchical levels for each of the above 9 combinations, we also considered the representative combinations between three types of phenotypic distribution and hierarchical level number: a normal distribution with five hierarchical levels ranging from 1 to 5 (Figure S1A,D,G), a uniform distribution with five hierarchical levels ranging from 1 to 5 (Figure S1B,E,H), and a binomial distribution with two hierarchical levels, that is, binary phenotype data with values of 0 and 1 (Figure S1C,F,I).Each of the total 27 simulation scenarios was repeated 100 times.All analyses were conducted on 96 CPU cores of Intel Xeon Platinum 8168 M processor at 2.70 GHz and 314 GB RAM.
The statistical power for each QTN detected (power) was defined as the proportion of samples over the Bonferroni threshold to the total number for the 100 replications.The average estimated effect (mean) was defined as the mean value of the effect estimated for 100 replications of the QTN for a fixed location.MSE represented the mean squared error, which is the average of the sum of squares of the differences between the estimated effect and the true value, and can be used to evaluate the accuracy of effect estimates.A smaller MSE value indicates a more accurate estimation of the algorithm, and, conversely, a larger MSE value indicates a lower accuracy.The Receiver Operating Characteristic (ROC) curve shows the statistical power under different Type I errors, which is an important index for evaluating the performance of a model.The heritability for each locus (r 2 ) was defined as the ratio of genotypic variance for each QTN to phenotypic variance.
We excluded individuals with missing phenotypes, non-polymorphic SNPs, and SNPs with an MAF less than 0.10.Then, we calculated the population structure using the ADMIXTURE software (Version 1.3.0)[31], selected the best population structure matrix according to its cross-validation (CV) error (Figure S2), and inserted it into model (1), treated as a fixed-effect design matrix for each binary or ordinal quantitative trait.
The number of significant loci detected, the number of confirmed genes identified, and the computing time were used to compare the performance of each method.Bonferroni correction was used and the threshold was set to 0.05/m, where m is the number of markers involved in the real data analysis.
For each significant locus, the Arabidopsis Information Resource (TAIR, https://www.arabidopsis.org(accessed on 16 May 2024)) was used to mine known genes located in the vicinity of 20 kilobases (kb), and known genes have been previously confirmed to be associated with the traits of interest in the literature.

Statistical Power for QTN Detection
In the first simulation experiment, the QTN located at the 98th SNP was simulated and its heritability is detailed in Table S1, with a small effect of 0.461.For ordinal phenotypes with five hierarchical levels generated from a normal distribution, it could be seen that the statistical power of FastRR was significantly higher than four other methods (Figure 3A, Table S1).For example, under the polygenic background of 2 k, POLMM, FaST-LMM, and FarmCPU had similar power, at 52%, 51%, and 49%, respectively, but this was significantly lower than FastRR, which had 88% power.Logistic regression had the lowest power at 25% (Figure 3A, Table S1).Under the polygenic backgrounds of 5 k and 10 k, FastRR had the highest power among all five methods at 58% and 28%, respectively (Figure 3A, Table S1).With the increasing influence of the polygenic background, the statistical power of all algorithms decreased, while FastRR always performed better (Figure 3A, Table S1).As shown in Figure 3B,C, and Table S1, a similar trend was observed for a uniform distribution with five hierarchical levels and a binomial distribution with two hierarchical levels.
In the second simulation experiment, five QTNs with a heritability of 0.526~6.401%were simulated at fixed positions (Table S2).As shown in Figure S3 and Table S2, the results revealed that FastRR had the highest statistical power over four other algorithms under different polygenic backgrounds.For example, for a normal distribution, FastRR achieved power of 98%, 86%, and 52% under 2 k, 5 k, and 10 k, respectively (Figure S3A,D,G, and Table S2).In contrast, FarmCPU, POLMM, and FaST-LMM had a power of 84%, 82%, and 76% under 2 k, while logistic regression achieved only 39% (Figure S3A,D,G, Table S2).The power of FastRR was similar to that of the uniform distribution

Statistical Power for QTN Detection
In the first simulation experiment, the QTN located at the 98th SNP was simulated and its heritability is detailed in Table S1, with a small effect of 0.461.For ordinal phenotypes with five hierarchical levels generated from a normal distribution, it could be seen that the statistical power of FastRR was significantly higher than four other methods (Figure 3A, Table S1).For example, under the polygenic background of 2 k, POLMM, FaST-LMM, and FarmCPU had similar power, at 52%, 51%, and 49%, respectively, but this was significantly lower than FastRR, which had 88% power.Logistic regression had the lowest power at 25% (Figure 3A, Table S1).Under the polygenic backgrounds of 5 k and 10 k, FastRR had the highest power among all five methods at 58% and 28%, respectively (Figure 3A, Table S1).With the increasing influence of the polygenic background, the statistical power of all algorithms decreased, while FastRR always performed better (Figure 3A, Table S1).As shown in Figure 3B,C, and Table S1, a similar trend was observed for a uniform distribution with five hierarchical levels and a binomial distribution with two hierarchical levels.
In the second simulation experiment, five QTNs with a heritability of 0.526~6.401%were simulated at fixed positions (Table S2).As shown in Figure S3 and Table S2, the results revealed that FastRR had the highest statistical power over four other algorithms under different polygenic backgrounds.For example, for a normal distribution, FastRR achieved power of 98%, 86%, and 52% under 2 k, 5 k, and 10 k, respectively (Figure S3A,D,G and Table S2).In contrast, FarmCPU, POLMM, and FaST-LMM had a power of 84%, 82%, and 76% under 2 k, while logistic regression achieved only 39% (Figure S3A,D,G, Table S2).The power of FastRR was similar to that of the uniform distribution and binomial distribution.For the uniform distribution (Figure S3B,E,H and Table S2), FastRR had values of 95%, Plants 2024, 13, 2520 9 of 20 73%, and 35% under 2 k, 5 k, and 10 k, respectively; and 85%, 50%, and 28% under a binomial distribution (Figure S3C,F,I and Table S2), respectively.In particular, for the detection of small-effect loci with a heritability of less than 5% and effect values less than 1, FastRR performed with significantly higher power compared to four other methods.For instance, under the binomial distribution, the power of FastRR for the 98th marker (QTN1, r 2 = 0.526~1.280%,true effect = 0.545) was at least twice that of the other methods (Figure S3C,F,I and Table S2).
Plants 2024, 13, x FOR PEER REVIEW 9 of 21 and binomial distribution.For the uniform distribution (Figure S3B,E,H, and Table S2), FastRR had values of 95%, 73%, and 35% under 2 k, 5 k, and 10 k, respectively; and 85%, 50%, and 28% under a binomial distribution (Figure S3C,F,I, and Table S2), respectively.In particular, for the detection of small-effect loci with a heritability of less than 5% and effect values less than 1, FastRR performed with significantly higher power compared to four other methods.For instance, under the binomial distribution, the power of FastRR for the 98th marker (QTN1, r 2 = 0.526~1.280%,true effect = 0.545) was at least twice that of the other methods (Figure S3C,F,I, and Table S2).In the third simulation experiment, 10 random-position QTNs of small effects were simulated with a total heritability of less than 19% and a heritability of 0.523~3.16%for each QTN.It can be seen from Figure S4 that the FastRR algorithm performed with the highest power over four other approaches for the detection of small-effect QTNs.For 2 k of ordinal simulated data, the power of FastRR exceeded 90%, followed by FarmCPU, which was slightly lower than 90%; nevertheless, the power of POLMM, FaST-LMM, and logistic regression were all below 85% (Figure S4A,B).Note that under the binomial distribution, FastRR significantly outperformed the other algorithms by more than 10% of power at 2 k and 5 k (Figure S4C).In addition, the power of all methods decreased by varying degrees as the influence of the polygenic background increased, but FastRR corrected for polygenic background was minimally affected by fluctuations in polygenic background (Figure S4) and its power was relatively stable.

ROC Curves at Different Levels of Significance
To compare detection power across different significance thresholds, we plotted ROC curves for five methods in simulation experiments 1 and 2 (Figures 4 and S5).ROC curves show that FastRR consistently outperforms other methods at various significance levels and maintains excellent detection power at low Type I error levels.In particular, FastRR demonstrated significant advantages in identifying small-effect QTN loci: in the first simulation experiment, the average statistical power of FastRR at significance levels ranging from 10 −6 to 10 −2 under the 2 k was found to be at least 48% greater than that of the secondbest method under the normal distribution (Figure 4A), at least 38% greater under the uniform distribution (Figure 4B), and at least 42% greater under the binomial distribution (Figure 4C).In the second simulation experiment, focusing on the small-effect loci QTN1 (r 2 = 0.526~1.280%,true effect = 0.545, Table S2), FastRR showed a significant advantage again, with its power being approximately 40% higher than the second-best method under all three distributions, further confirming its ability to detect small-effect QTNs (Figure S5A).In the third simulation experiment, 10 random-position QTNs of small effects were simulated with a total heritability of less than 19% and a heritability of 0.523~3.16%for each QTN.It can be seen from Figure S4 that the FastRR algorithm performed with the highest power over four other approaches for the detection of small-effect QTNs.For 2 k of ordinal simulated data, the power of FastRR exceeded 90%, followed by FarmCPU, which was slightly lower than 90%; nevertheless, the power of POLMM, FaST-LMM, and logistic regression were all below 85% (Figure S4A,B).Note that under the binomial distribution, FastRR significantly outperformed the other algorithms by more than 10% of power at 2 k and 5 k (Figure S4C).In addition, the power of all methods decreased by varying degrees as the influence of the polygenic background increased, but FastRR corrected for polygenic background was minimally affected by fluctuations in polygenic background (Figure S4) and its power was relatively stable.

ROC Curves at Different Levels of Significance
To compare detection power across different significance thresholds, we plotted ROC curves for five methods in simulation experiments 1 and 2 (Figures 4 and S5).ROC curves show that FastRR consistently outperforms other methods at various significance levels and maintains excellent detection power at low Type I error levels.In particular, Fas-tRR demonstrated significant advantages in identifying small-effect QTN loci: in the first simulation experiment, the average statistical power of FastRR at significance levels ranging from 10 −6 to 10 −2 under the 2 k was found to be at least 48% greater than that of the second-best method under the normal distribution (Figure 4A), at least 38% greater under the uniform distribution (Figure 4B), and at least 42% greater under the binomial distribution (Figure 4C).In the second simulation experiment, focusing on the small-effect loci QTN1 (r 2 = 0.526~1.280%,true effect = 0.545, Table S2), FastRR showed a significant advantage again, with its power being approximately 40% higher than the second-best method under all three distributions, further confirming its ability to detect small-effect QTNs (Figure S5A).
These results indicate that FastRR not only has excellent detection of large-effect QTNs, but also has distinct advantages in identifying small-effect QTNs.

Accuracy for Estimated QTN Effects
The mean and MSE were used to measure the accuracy of an estimated QTN effect, and SD was used to evaluate the stability of an estimated QTN effect.We evaluated the accuracy for the fixed positions, including the first and second simulation experiments, across all five methods, as listed in Tables S1 and S2.In the first simulation experiment, In addition, the detection ability of all methods was increasing along with the effect growth (Figure S5A-E).For example, for QTN4 (r 2 = 2.105%~5.121%,true effect = 1.079,Table S2), all five methods achieved efficiencies greater than 85% when the significance level exceeded 10 −7 under 2 k (Figure S5D).Similarly, for QTN5 (r 2 = 2.631~6.401%,true effect = 1.209,Table S2), nearly all methods achieved 100% efficacy under 2 k, as evidenced by overlapping ROC curves (Figure S5E).
These results indicate that FastRR not only has excellent detection of large-effect QTNs, but also has distinct advantages in identifying small-effect QTNs.

Accuracy for Estimated QTN Effects
The mean and MSE were used to measure the accuracy of an estimated QTN effect, and SD was used to evaluate the stability of an estimated QTN effect.We evaluated the accuracy for the fixed positions, including the first and second simulation experiments, across all five methods, as listed in Tables S1 and S2.In the first simulation experiment, POLMM and logistic regression methods exhibited the closest mean estimates to true values, followed by FastRR, FarmCPU, and FaST-LMM (Table S1).In terms of MSE, POLMM, FarmCPU, and FaST-LMM had the lowest values, followed by FastRR and logistic regression (Table S1).Regarding the stability, FastRR demonstrated SD values comparable to other methods, indicating good stability (Table S1).In the second simulation experiment, the accuracy and stability of QTN effect estimates were found to be comparable to those observed in the first simulation experiment (Table S2).These results indicate that FastRR has a robust effect estimation capability.Nevertheless, further improvements are warranted to improve the accuracy of the effect estimate when compared to the other methods.

Computing Time
We compared the average computing time of 100 iterations in three simulation experiments using five methods, and found that the computing time of FastRR was relatively fast and stable (Figure 5).In the first simulation experiment, FastRR was comparable in speed to POLMM and FarmCPU, all of which were finished within 75 s.Logistic regression took slightly longer than the above three methods, followed by FaST-LMM, which took 318.865s, at least three times longer than FastRR (Figure 5A-C).The result in the second simulation experiment showed a similar pattern, with FastRR and POLMM again showing the fastest computational speeds, all completed within 75 s.FarmCPU took slightly longer, while the logistic regression and FaST-LMM methods both took over 100 s, with FaST-LMM nearly five times longer than FastRR (Figure 5D-F).In the third simulation experiment, the average computational speed of FastRR exhibited consistent performance across different polygenic backgrounds and distributions, with a time of less than 90 s, significantly lower than that of FarmCPU and FaST-LMM (Figure 5G-I).For the number of significant loci after Bonferroni correction, FastRR was significantly better than the other methods (Table 1).Specifically, for 11 out of 14 traits, including avrPphB, avrRpt2, avrB, Anthocyanin 10, 16, and 22, Leaf roll 16 and 22, Leaf serr 10 and 16, and Silique 22, it could be shown that FastRR detected more significant loci than the

Significant Loci Associated with Binary or Ordinal Traits
For the number of significant loci after Bonferroni correction, FastRR was significantly better than the other methods (Table 1).Specifically, for 11 out of 14 traits, including avrPphB, avrRpt2, avrB, Anthocyanin 10, 16, and 22, Leaf roll 16 and 22, Leaf serr 10 and 16, and Silique 22, it could be shown that FastRR detected more significant loci than the other methods (Table 1).FastRR identified a total of 479 significant loci associated with 14 traits, followed by FarmCPU, POLMM, and FaST-LMM with 36, 15, and 14 loci, respectively, and logistic regression did not detect any significant loci (Table 1).Notably, we found that the multi-locus approach performed better in detecting small-effect loci with low heritability compared to the single-locus approach.As shown in Table S3, the multi-locus method FastRR detected significant loci associated with known genes, with heritability ranging from 0.9% to 6.1%, and only two loci were higher than 5%; the multi-locus method FarmCPU had four loci with heritability lower than 5% and six loci with heritability higher than 5%.In contrast, the single-focus method FaST-LMM detected heritability of significant loci associated with known genes that were all above 5%.In general, FastRR demonstrates superiority in detecting small-effect loci with low heritability.a : The optimal K value corresponding to the population structure matrix for each trait.Bold represents the largest number of significant loci or known genes among the five methods.

Known Genes around Significant Loci
By retrieving the known genes on the TAIR website (https://www.arabidopsis.org/(accessed on 16 May 2024)), FastRR detected a total of 76 known genes near the significant loci, which is 7 times more than the second ranked FarmCPU (Figure 6, Tables 2 and S3).FarmCPU detected a total of 10 known genes; POLMM and FaST-LMM detected 9 and 8 known genes, respectively; at the same time, logistic regression did not detect any known genes.
The FastRR algorithm detected multiple gene clusters for the same trait (Tables 2 and S3).For example, for trait avrRpm1, it was able to detect adjacent genes AT3G59700, AT3G59730, AT3G59740, and AT3G59750 located at the SNP on chromosome 3 at 22,058,868 bp; for trait Leaf serr 22, it was able to detect adjacent genes AT2G19620, AT2G19690, and AT2G19730 located at SNPs on chromosome 2 at 8,504,630 bp; for trait avrPphB, the adjacent genes AT3G26450, AT3G26460, and AT3G26470 were detected for the SNP located at 9,700,429 bp on chromosome 3.Overall, FastRR revealed the capability to detect clusters of genes controlling target traits simultaneously.
Similarly, as shown in Figure 6, Tables 2 and S3, known genes detected by FastRR can be simultaneously identified by other methods.For example, for trait avrPphB, genes AT1G12210, AT1G12220, and AT1G12240, which are adjacent to SNPs located at 4,144,558 bp and 4,150,466 bp on chromosome 1, were also detected simultaneously by FaST-LMM and POLMM.For trait avrRpt2, the gene AT4G26120, which is adjacent to SNPs located at 13,224,573 bp and 13,225,030 bp on chromosome 4, were also detected simultaneously by FarmCPU, POLMM, and FaST-LMM (Tables 2 and S3).This indicates that FastRR is more reliable in mining genes.
Notably, FastRR can detect pleiotropic genes.For example, on chromosome 1, SNPs at 26,159,219 bp and 26,192,702 bp for the known gene AT1G69588 were significantly associated with Leaf serr 22 and Leaf roll 10, respectively (Tables 2 and S3).Known genes AT3G59700, AT3G59730, AT3G59740, and AT3G59750 are located near the SNP on chromosome 3 at 22,058,868 bp and are associated with avrB and avrRpm1 (Tables 2 and S3).Known genes AT3G06980 and AT3G07040 are located near the SNPs on chromosome 3 at 2,181,673 bp and 2,225,659 bp, respectively, which are also associated with avrB and avrRpm1 (Tables 2 and S3).These findings demonstrate the ability of FastRR to detect pleiotropic genes.    2 and S3).Known genes AT3G59700, AT3G59730, AT3G59740, and AT3G59750 are located near the SNP on chromosome 3 at 22,058,868 bp and are associated with avrB and avrRpm1 (Tables 2 and S3).Known genes AT3G06980 and AT3G07040 are located near the SNPs on chromosome 3 at 2,181,673 bp and 2,225,659 bp, respectively, which are also associated with avrB and avrRpm1 (Tables 2 and S3).These findings demonstrate the ability of FastRR to detect pleiotropic genes.

Computing Time
For all traits, FastRR was computationally much faster than all the other methods (Table 3).The average computing time of FastRR was about 60 s; logistic regression, FarmCPU, FaST-LMM, and POLMM all took at least three times as long as FastRR.Among them, logistic regression was more computationally intensive than other methods, taking more than 10 times as long as FastRR.For example, for the trait avrRpt2, FastRR took about 60 s to run, while the other four methods took more than 200 s, and logistic regression took as long as 874.381 s (Table 3).Obviously, FastRR has a significant advantage in computational speed.Most analyses of binary or ordinal traits rely on logistic or polychotomous logistic models that employ link functions to convert probabilistic models into linear models, which increases the computational expense.And methods designed for binary or continuous traits are commonly used to inappropriately analyze ordinal phenotypes, which suffer from information loss.However, FastRR offers a distinct advantage as it does not require the use of link functions; meanwhile, it could not lose useful information.This allows for direct application to binary or ordinal phenotype data.By correcting for polygenic background and population structure, and employing matrix transformation, FastRR converts these binary or ordinal traits into continuous traits instead of link functions (Equations ( 3)-( 5)).Consequently, this approach avoids the computational complexity caused by link functions as well as the information loss caused by inappropriate methods.
For ordinal traits, we compared FastRR with two commonly used strategies: one is treating the ordinal phenotype as a continuous trait and then using FarmCPU and FaST-LMM.The other is dichotomizing the ordinal phenotype and then using logistic regression.The former violates the nature of the ordinal phenotype, and the latter could lose useful phenotypic information and statistical power [5,6,13].We also compared FastRR with an ordinal method of POLMM.Both simulation studies of five hierarchical levels (Figures 3A,B and S3A,B,D,E,G,H, and Tables S1 and S2) and real data analysis of five and ten hierarchical levels (Tables 1, 2 and S3) revealed that FastRR avoided the above strategies and employed a continuous-transformed stage in model ( 5), indicating the strong power to detect QTN and mine genes associated with the ordinal traits of interest.
For binary traits, we directly compared FastRR with binary methods, including logistic regression and POLMM.Meanwhile, we also compared FastRR with an inappropriate strategy of treating the binary phenotype as a continuous trait and then using FarmCPU and FaST-LMM.Through simulation studies of the binomial distribution (Figures 3C and S3C,F,I, Tables S1 and S2) and real data analysis of two hierarchical levels (Tables 1, 2 and S3), it can be seen that FastRR is still reliable and valid for binary phenotype.

Extensive Applicability of the FastRR Method
Although the phenotype data were assumed to follow a normal distribution in model (1), the results of the simulation experiments comprehensively validated the extensive applicability of FastRR.First of all, phenotype data with varying distributions and hierarchical levels were selected for generating binary or ordinal traits, including the normal, uniform, and binomial distribution, with numbers of hierarchical level of 2 and 5.In addition, three sets of simulation studies were conducted to examine different QTNs and their heritability settings, including a single fixed-position QTN with a heritability of 0.407~1.12%,multiple fixed-position QTNs with a heritability of 0.526~6.401%,and multiple random-position QTNs with a total heritability of 7.85~18.96%.Finally, each simulation incorporated varying degrees of polygenic backgrounds with 2, 5, and 10 times the polygenic background (Figure S1, Tables S1 and S2).A total of 27 simulation scenarios were conducted.For Arabidopsis real data, 14 representative quantitative traits were also selected, including ten binary traits and four ordinal traits, and their hierarchical level number is 2, 5, and 10 (Figure 2, Table 1).These settings and selections fully demonstrated the wide range of applications of FastRR, demonstrating its robust detection capabilities across a variety of distributions beyond normal, including binary, multi-categorical, and other continuous distributions.

Prospects of the FastRR Method
FastRR has identified a series of true QTNs in simulation studies (Figures 3, S3 and S4; Tables S1 and S2) and known genes in real data analysis (Figure 6, Tables 2 and S3).Moreover, the QTNs or genes identified by multiple methods are deemed as reliable QTNs or genes [14].As shown in Figure 6 and Tables 2 and S3, more known genes detected by FastRR can be simultaneously identified by other methods, indicating it is more reliable in mining genes.
As a multi-locus method that utilizes dimension reduction through variable selection, we compared FastRR with another multi-locus method, FarmCPU, andthree single-locus methods, including logistic regression, FaST-LMM, and POLMM.The results revealed that the multi-locus model considers the potential relationships between neighboring loci and explains the genetic basis of complex quantitative traits in plants better than the singlelocus model, which is consistent with the previous literature [14].The FastRR method, for example, shows excellent performance.Compared to FaST-LMM and POLMM, it increases power by at least 35% under 2 k in the first simulation experiment (Figure 3A, Table S1).Moreover, detecting small-effect loci with low heritability has been an issue in the analysis of complex categorical quantitative traits.The results reveal that for true QTN with a heritability of less than 6.401% and an effect of less than 1, FastRR demonstrates significantly superior power compared to other methods (Figures 4 and S5, and Tables S1 and S2).For instance, in the second simulation experiment, the power for QTN1, which has the lowest heritability, exceeded that of other methods by at least 14% in comparison to the other four QTNs, with the advantage becoming more pronounced as the genetic background increased (Figure S3D-F, Table S2).Furthermore, in the real data, the heritability of significant loci detected by FastRR associated with known genes was consistently less than 5% in all cases (Tables 2 and S3).In addition, FastRR, as a multi-stage algorithm, reduces the dimensionality of the raw large-scale data by variable reduction stage, and then performs the multi-locus analysis, which sharply improves the computational efficiency [22].Regarding Arabidopsis real data analysis, FastRR was the fastest algorithm among all five methods (Table 3).Meanwhile, we also evaluated another three multi-locus approaches, including EBLASSO-NEG and EBLASSO-NE [23] for binary traits, and the method of Feng et al. [24] for ordinal traits.Unfortunately, none of these results have been implemented due to their low computational efficiency for large-scale data.In general, multi-locus models using variable reduction in the FastRR algorithm can be used to detect the potential relationships between neighboring loci and to mine the small-effect loci or genes with sharply rapid computation, which can be expanded to the analysis of large-scale data and multiple phenotypes.It is beneficial to allow each individual phenotype's model to share information that can lead to better results and increased power [32].This will be our future research.

Conclusions
In this study, the recently proposed FastRR algorithm for continuous traits was directly applied to binary and ordinal traits for genome-wide association studies.It converts the binary or ordinal trait to a continuous phenotype by polygenic background correction and special matrix transformation, instead of link functions introduced or methods inappropriately used.Compared with four other continuous/categorical data approaches, FastRR has been verified to have valid and superior performance in terms of QTN detection power, accuracy and computation in a series of simulation experiments involving different QTN settings, heritability, phenotypic distributions, hierarchical levels, and polygenic backgrounds.This superiority is particularly evident in the detection of small-effect loci, where FastRR excelled.In Arabidopsis real data analysis, FastRR identified 479 significant loci and 76 known genes associated with ten binary and four ordinal traits.In summary, FastRR provides an efficient GWAS tool for continuous, binary, and ordinal phenotypes.

Supplementary Materials:
The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/plants13172520/s1, Figure S1: various phenotypic distributions and hierarchical levels in three simulation experiments.From left to right, each column illustrates a different distribution.From top to bottom, each row represents the first (A-C), second (D-F), and third (G-I) simulation experiment, respectively.Figure S2: the relationship between CV error and K values for group structure of 14 binary or ordinal traits in the Arabidopsis real dataset.Figure S3: statistical power for 5 fixed-position QTNs detected by five methods in the second simulation experiment.From left to right, each column illustrates a different distribution.From top to bottom, each row represents 2 (A-C), 5 (D-F), and 10 (G-I) times the polygenic background, respectively.Figure S4: statistical power for 10 random-position QTNs detected by five methods in the third simulation experiment under (A) normal distribution with 5 hierarchical levels, (B) uniform distribution with 5 hierarchical levels, and (C) binomial distribution with 2 hierarchical levels.Figure S5: ROC curves for five fixed-position QTNs (A:QTN1; B:QTN2; C:QTN3; D:QTN4; E:QTN5) for five methods in the second simulation experiment.Each figure is given from left to right; each column illustrates a different distribution.From top to bottom, each row represents 2, 5, and 10 times the polygenic background, respectively.Table S1: statistical power, and accuracy in the detection of QTN using five methods in the first simulation experiment.Table S2: statistical power, and accuracy in the detection of QTN using five methods in the second simulation experiment.Table S3: significant QTNs and their known genes detected using five methods for 14 binary or ordinal traits in the Arabidopsis real dataset.

Figure 1 .
Figure 1.A flow chart of the FastRR method.

Figure 1 .
Figure 1.A flow chart of the FastRR method.

Figure 3 .
Figure 3.The statistical power for QTN detected by five methods in the first simulation experiment under (A) a normal distribution with 5 hierarchical levels, (B) a uniform distribution with 5 hierarchical levels, and (C) a binomial distribution with 2 hierarchical levels.

Figure 3 .
Figure 3.The statistical power for QTN detected by five methods in the first simulation experiment under (A) a normal distribution with 5 hierarchical levels, (B) a uniform distribution with 5 hierarchical levels, and (C) a binomial distribution with 2 hierarchical levels.

Figure 4 .
Figure 4. ROC curves for the five methods of the first simulation experiment.From top to bottom, each row represents 2 (A-C), 5 (D-F), and 10 (G-I) times the polygenic background, respectively.

Figure 4 .
Figure 4. ROC curves for the five methods of the first simulation experiment.From top to bottom, each row represents 2 (A-C), 5 (D-F), and 10 (G-I) times the polygenic background, respectively.

Figure 5 .
Figure 5.The average computing time using five methods in three simulation experiments.From top to bottom, each row represents the first (A-C), second (D-F), and third (G-I) simulation experiment, respectively.3.2.Analysis of Arabidopsis Dataset 3.2.1.Significant Loci Associated with Binary or Ordinal Traits

Figure 5 .
Figure 5.The average computing time using five methods in three simulation experiments.From top to bottom, each row represents the first (A-C), second (D-F), and third (G-I) simulation experiment, respectively.
sociated with Leaf serr 22 and Leaf roll 10, respectively (Tables

Figure 6 .
Figure 6.A heatmap of known genes identified by five methods for fourteen binary or ordinal traits in the Arabidopsis real dataset.

Figure 6 .
Figure 6.A heatmap of known genes identified by five methods for fourteen binary or ordinal traits in the Arabidopsis real dataset.

Table 1 .
A comparison of the number of significant loci/known genes detected by five methods for fourteen binary or ordinal traits in the Arabidopsis real dataset.

Table 2 .
Known genes around significant loci identified by five approaches for fourteen binary or ordinal traits in the Arabidopsis real dataset.

Table 2 .
Known genes around significant loci identified by five approaches for fourteen binary or ordinal traits in the Arabidopsis real dataset.

Table 3 .
A comparison of computation times (in seconds) for five methods applied to fourteen binary and ordinal traits in the Arabidopsis real dataset.Bold represents the fastest computing time among the five methods. Note: